markdown comments noted by the student/author (John Leonard) are highlighted in red. The final section of the document (section 7) contains the informal written report
FROM: Danielle Sherman
Subject: Brand Preference Prediction Hello,
The sales team has again consulted with me with some concerns about ongoing product sales in one of our stores. Specifically, they have been tracking the sales performance of specific product types and would like us to redo our previous sales prediction analysis, but this time they’d like us to include the ‘product type’ attribute in our predictions to better understand how specific product types perform against each other. They have asked our team to analyze historical sales data and then make sales volume predictions for a list of new product types, some of which are also from a previous task. This will help the sales team better understand how types of products might impact sales across the enterprise.
I have attached historical sales data and new product data sets to this email. I would like for you to do the analysis with the goals of:
Predicting sales of four different product types: PC, Laptops, Netbooks and Smartphones Assessing the impact services reviews and customer reviews have on sales of different product types When you have completed your analysis, please submit a brief report that includes the methods you employed and your results. I would also like to see the results exported from R for each of the methods.
Thanks,
Danielle
existingproductattributes2017 newproductattributes2017
You have been asked by Danielle Sherman, CTO of Blackwell Electronics, to predict the sales in four different product types while assessing the effects service and customer reviews have on sales. You’ll be using Regression to build machine learning models for this analyses using a choice of two of three popular algorithms. Once you have determined which one works better on the provided data set, Danielle would like you to predict the sales of four product types from the new products list and prepare a report of your findings.
This task requires you to prepare one deliverable for Danielle Sherman:
Sales Prediction Report. A report in a Zip file that includes:
#Load the libraries & set random seed
library(caret)
library(readr)
set.seed(1)
#load the data set
df <- read_csv("existingproductattributes2017.csv");
Parsed with column specification:
cols(
ProductType = [31mcol_character()[39m,
ProductNum = [32mcol_double()[39m,
Price = [32mcol_double()[39m,
x5StarReviews = [32mcol_double()[39m,
x4StarReviews = [32mcol_double()[39m,
x3StarReviews = [32mcol_double()[39m,
x2StarReviews = [32mcol_double()[39m,
x1StarReviews = [32mcol_double()[39m,
PositiveServiceReview = [32mcol_double()[39m,
NegativeServiceReview = [32mcol_double()[39m,
Recommendproduct = [32mcol_double()[39m,
BestSellersRank = [32mcol_double()[39m,
ShippingWeight = [32mcol_double()[39m,
ProductDepth = [32mcol_double()[39m,
ProductWidth = [32mcol_double()[39m,
ProductHeight = [32mcol_double()[39m,
ProfitMargin = [32mcol_double()[39m,
Volume = [32mcol_double()[39m
)
In previous Regression tasks, you needed to remove non-numeric features to make predictions; however, typical datasets don’t contain only numeric values. Most data will contain a mixture of numeric and nominal data so we need to understand how to incorporate both when it comes to developing regression models and making predictions.
Categorical variables may be used directly as predictor or predicted variables in a multiple regression model as long as they’ve been converted to binary values. In order to pre-process the sales data as needed we first need to convert all factor or ‘chr’ classes to binary features that contain ‘0’ and ‘1’ classes. Fortunately, caret has a method for creating these ‘Dummy Variables’ as follows:
# one-hot encode (dummify) the data
df_preprocessed <- dummyVars(" ~.",data = df)
df_preprocessed <- data.frame(predict(df_preprocessed,newdata = df))
colnames(df_preprocessed)
[1] "ProductTypeAccessories" "ProductTypeDisplay"
[3] "ProductTypeExtendedWarranty" "ProductTypeGameConsole"
[5] "ProductTypeLaptop" "ProductTypeNetbook"
[7] "ProductTypePC" "ProductTypePrinter"
[9] "ProductTypePrinterSupplies" "ProductTypeSmartphone"
[11] "ProductTypeSoftware" "ProductTypeTablet"
[13] "ProductNum" "Price"
[15] "x5StarReviews" "x4StarReviews"
[17] "x3StarReviews" "x2StarReviews"
[19] "x1StarReviews" "PositiveServiceReview"
[21] "NegativeServiceReview" "Recommendproduct"
[23] "BestSellersRank" "ShippingWeight"
[25] "ProductDepth" "ProductWidth"
[27] "ProductHeight" "ProfitMargin"
[29] "Volume"
Correlation as you likely already know about is a measure of the relationship between two or more features or variables. In this problem, you were tasked with ascertaining if some specific features impact on weekly sales volume.
str(df_preprocessed) #Check data structure
'data.frame': 80 obs. of 29 variables:
$ ProductTypeAccessories : num 0 0 0 0 0 1 1 1 1 1 ...
$ ProductTypeDisplay : num 0 0 0 0 0 0 0 0 0 0 ...
$ ProductTypeExtendedWarranty: num 0 0 0 0 0 0 0 0 0 0 ...
$ ProductTypeGameConsole : num 0 0 0 0 0 0 0 0 0 0 ...
$ ProductTypeLaptop : num 0 0 0 1 1 0 0 0 0 0 ...
$ ProductTypeNetbook : num 0 0 0 0 0 0 0 0 0 0 ...
$ ProductTypePC : num 1 1 1 0 0 0 0 0 0 0 ...
$ ProductTypePrinter : num 0 0 0 0 0 0 0 0 0 0 ...
$ ProductTypePrinterSupplies : num 0 0 0 0 0 0 0 0 0 0 ...
$ ProductTypeSmartphone : num 0 0 0 0 0 0 0 0 0 0 ...
$ ProductTypeSoftware : num 0 0 0 0 0 0 0 0 0 0 ...
$ ProductTypeTablet : num 0 0 0 0 0 0 0 0 0 0 ...
$ ProductNum : num 101 102 103 104 105 106 107 108 109 110 ...
$ Price : num 949 2250 399 410 1080 ...
$ x5StarReviews : num 3 2 3 49 58 83 11 33 16 10 ...
$ x4StarReviews : num 3 1 0 19 31 30 3 19 9 1 ...
$ x3StarReviews : num 2 0 0 8 11 10 0 12 2 1 ...
$ x2StarReviews : num 0 0 0 3 7 9 0 5 0 0 ...
$ x1StarReviews : num 0 0 0 9 36 40 1 9 2 0 ...
$ PositiveServiceReview : num 2 1 1 7 7 12 3 5 2 2 ...
$ NegativeServiceReview : num 0 0 0 8 20 5 0 3 1 0 ...
$ Recommendproduct : num 0.9 0.9 0.9 0.8 0.7 0.3 0.9 0.7 0.8 0.9 ...
$ BestSellersRank : num 1967 4806 12076 109 268 ...
$ ShippingWeight : num 25.8 50 17.4 5.7 7 1.6 7.3 12 1.8 0.75 ...
$ ProductDepth : num 23.9 35 10.5 15 12.9 ...
$ ProductWidth : num 6.62 31.75 8.3 9.9 0.3 ...
$ ProductHeight : num 16.9 19 10.2 1.3 8.9 ...
$ ProfitMargin : num 0.15 0.25 0.08 0.08 0.09 0.05 0.05 0.05 0.05 0.05 ...
$ Volume : num 12 8 12 196 232 332 44 132 64 40 ...
summary(df_preprocessed)
ProductTypeAccessories ProductTypeDisplay ProductTypeExtendedWarranty
Min. :0.000 Min. :0.0000 Min. :0.000
1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000
Median :0.000 Median :0.0000 Median :0.000
Mean :0.325 Mean :0.0625 Mean :0.125
3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:0.000
Max. :1.000 Max. :1.0000 Max. :1.000
ProductTypeGameConsole ProductTypeLaptop ProductTypeNetbook ProductTypePC
Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.00
1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.00
Median :0.000 Median :0.0000 Median :0.000 Median :0.00
Mean :0.025 Mean :0.0375 Mean :0.025 Mean :0.05
3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:0.00
Max. :1.000 Max. :1.0000 Max. :1.000 Max. :1.00
ProductTypePrinter ProductTypePrinterSupplies ProductTypeSmartphone
Min. :0.00 Min. :0.0000 Min. :0.00
1st Qu.:0.00 1st Qu.:0.0000 1st Qu.:0.00
Median :0.00 Median :0.0000 Median :0.00
Mean :0.15 Mean :0.0375 Mean :0.05
3rd Qu.:0.00 3rd Qu.:0.0000 3rd Qu.:0.00
Max. :1.00 Max. :1.0000 Max. :1.00
ProductTypeSoftware ProductTypeTablet ProductNum Price
Min. :0.000 Min. :0.0000 Min. :101.0 Min. : 3.60
1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:120.8 1st Qu.: 52.66
Median :0.000 Median :0.0000 Median :140.5 Median : 132.72
Mean :0.075 Mean :0.0375 Mean :142.6 Mean : 247.25
3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:160.2 3rd Qu.: 352.49
Max. :1.000 Max. :1.0000 Max. :200.0 Max. :2249.99
x5StarReviews x4StarReviews x3StarReviews x2StarReviews
Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 10.0 1st Qu.: 2.75 1st Qu.: 2.00 1st Qu.: 1.00
Median : 50.0 Median : 22.00 Median : 7.00 Median : 3.00
Mean : 176.2 Mean : 40.20 Mean : 14.79 Mean : 13.79
3rd Qu.: 306.5 3rd Qu.: 33.00 3rd Qu.: 11.25 3rd Qu.: 7.00
Max. :2801.0 Max. :431.00 Max. :162.00 Max. :370.00
x1StarReviews PositiveServiceReview NegativeServiceReview Recommendproduct
Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. :0.100
1st Qu.: 2.00 1st Qu.: 2.00 1st Qu.: 1.000 1st Qu.:0.700
Median : 8.50 Median : 5.50 Median : 3.000 Median :0.800
Mean : 37.67 Mean : 51.75 Mean : 6.225 Mean :0.745
3rd Qu.: 15.25 3rd Qu.: 42.00 3rd Qu.: 6.250 3rd Qu.:0.900
Max. :1654.00 Max. :536.00 Max. :112.000 Max. :1.000
BestSellersRank ShippingWeight ProductDepth ProductWidth
Min. : 1 Min. : 0.0100 Min. : 0.000 Min. : 0.000
1st Qu.: 7 1st Qu.: 0.5125 1st Qu.: 4.775 1st Qu.: 1.750
Median : 27 Median : 2.1000 Median : 7.950 Median : 6.800
Mean : 1126 Mean : 9.6681 Mean : 14.425 Mean : 7.819
3rd Qu.: 281 3rd Qu.:11.2050 3rd Qu.: 15.025 3rd Qu.:11.275
Max. :17502 Max. :63.0000 Max. :300.000 Max. :31.750
NA's :15
ProductHeight ProfitMargin Volume
Min. : 0.000 Min. :0.0500 Min. : 0
1st Qu.: 0.400 1st Qu.:0.0500 1st Qu.: 40
Median : 3.950 Median :0.1200 Median : 200
Mean : 6.259 Mean :0.1545 Mean : 705
3rd Qu.:10.300 3rd Qu.:0.2000 3rd Qu.: 1226
Max. :25.800 Max. :0.4000 Max. :11204
#drop columsn that contain NA
drops <- c("BestSellersRank")
df_preprocessed <- df_preprocessed[,!(names(df_preprocessed) %in% drops)]
names(df_preprocessed)
[1] "ProductTypeAccessories" "ProductTypeDisplay"
[3] "ProductTypeExtendedWarranty" "ProductTypeGameConsole"
[5] "ProductTypeLaptop" "ProductTypeNetbook"
[7] "ProductTypePC" "ProductTypePrinter"
[9] "ProductTypePrinterSupplies" "ProductTypeSmartphone"
[11] "ProductTypeSoftware" "ProductTypeTablet"
[13] "ProductNum" "Price"
[15] "x5StarReviews" "x4StarReviews"
[17] "x3StarReviews" "x2StarReviews"
[19] "x1StarReviews" "PositiveServiceReview"
[21] "NegativeServiceReview" "Recommendproduct"
[23] "ShippingWeight" "ProductDepth"
[25] "ProductWidth" "ProductHeight"
[27] "ProfitMargin" "Volume"
df_corr <-cor(df_preprocessed)
Correlation values fall within -1 and 1 with variables have string positive relationships having correlation values closer to 1 and strong negative relationships with values closer to -1. What kind of relationship do two variables with a correlation of ‘0’ have?
0 correlation corresponds to no linear relationship between the two columns being correlated.
It is often very helpful to visualize the correlation matrix with a heat map so we can ‘see’ the impact different variables have on one another. To generate a heat map for your correlation matrix we’ll use corrplot package as follows:
#install.packages("corrplot")
library(corrplot)
corrplot(df_corr,order="hclust",tl.col="black", tl.srt=90,tl.cex = .45)
blue (cooler) colors show a positive relationship and red (warmer) colors indicate more negative relationships. Knowing what you do about correlation - what do you think intersections in the chart without colors represent?
0 correlation corresponds to no linear relationship between the two columns being correlated.
Using the heat map, review the service and customer review relationships with sales volume and note the associated correlations for your report. If you would like more detailed correlation figures than those available with the heat map, enter the name of your correlation object into console and review the printed information.
Now that you know the relationships between all of the variables in the data it is a good time to remove any features that aren’t needed for your analysis.
# Search for cross correlations > 0.95 and < 1 that aren't related to the label_column ("Volume")
label_column <- "Volume"
drops <- c(label_column)
df_corr_abs <- abs(df_corr)
df_corr_abs <- df_corr_abs[,!(colnames(df_corr_abs) %in% drops)] #drop the label column so you dont remove features correlated with this label
for (col_name in c(colnames(df_corr_abs))){
df_column <-df_corr_abs[,col_name]
df_strong_corr<-df_column[df_column>0.95]
if (length(df_strong_corr)>0) {
print(col_name)
print(df_strong_corr)
}
}
[1] "ProductTypeAccessories"
ProductTypeAccessories
1
[1] "ProductTypeDisplay"
ProductTypeDisplay
1
[1] "ProductTypeExtendedWarranty"
ProductTypeExtendedWarranty
1
[1] "ProductTypeGameConsole"
ProductTypeGameConsole
1
[1] "ProductTypeLaptop"
ProductTypeLaptop
1
[1] "ProductTypeNetbook"
ProductTypeNetbook
1
[1] "ProductTypePC"
ProductTypePC
1
[1] "ProductTypePrinter"
ProductTypePrinter
1
[1] "ProductTypePrinterSupplies"
ProductTypePrinterSupplies
1
[1] "ProductTypeSmartphone"
ProductTypeSmartphone
1
[1] "ProductTypeSoftware"
ProductTypeSoftware
1
[1] "ProductTypeTablet"
ProductTypeTablet
1
[1] "ProductNum"
ProductNum
1
[1] "Price"
Price
1
[1] "x5StarReviews"
x5StarReviews Volume
1 1
[1] "x4StarReviews"
x4StarReviews
1
[1] "x3StarReviews"
x3StarReviews
1
[1] "x2StarReviews"
x2StarReviews x1StarReviews
1.000000 0.951913
[1] "x1StarReviews"
x2StarReviews x1StarReviews
0.951913 1.000000
[1] "PositiveServiceReview"
PositiveServiceReview
1
[1] "NegativeServiceReview"
NegativeServiceReview
1
[1] "Recommendproduct"
Recommendproduct
1
[1] "ShippingWeight"
ShippingWeight
1
[1] "ProductDepth"
ProductDepth
1
[1] "ProductWidth"
ProductWidth
1
[1] "ProductHeight"
ProductHeight
1
[1] "ProfitMargin"
ProfitMargin
1
# Delete one of the pairs from each correlation
drops <- c("x1StarReviews","x5StarReviews")
df_preprocessed<-df_preprocessed[,!(colnames(df_preprocessed) %in% drops)]
#Transform Volume to log10_Volume to prevent predictions of <0 volume
df_preprocessed<-df_preprocessed[!(df$Volume==0),]
df_preprocessed['log10_Volume'] <- log10(df_preprocessed$Volume)
drops <- c("Volume")
df_preprocessed<-df_preprocessed[,!(colnames(df_preprocessed) %in% drops)]
#Visualize the data
plot_summary_of_data<-function(DatasetName,x_index=1){
column_names = names(DatasetName)
subplot_cols = 2
subplot_rows = 2
par(mfrow=c(subplot_rows,subplot_cols))
x <- unlist(DatasetName[,x_index])
x_header = column_names[x_index]
for(i in 1:length(column_names)){
if(i != x_index) {
y <- unlist(DatasetName[,i])
y_header = column_names[i]
try(plot(x,y, xlab = x_header, ylab = y_header),silent=TRUE) #Scatter (Box) Plot
}
}
}
plot_summary_of_data(df_preprocessed,x_index=26)
In this step you will build models, make predictions and learn which algorithms are appropriate for parametric and non-parametric data sets.
#setup seed for reproducability
set.seed(1)
# Define Label
y <- df_preprocessed$log10_Volume
#define a 75-25% train-test split of the dataset
inTraining <- createDataPartition(y, p = .75, list = FALSE)
df_train <- df_preprocessed[inTraining,]
df_test <- df_preprocessed[-inTraining,]
y_train = df_train$log10_Volume
y_test = df_test$log10_Volume
#check dimensions of train & test set
dim(df_train); dim(df_test);
[1] 58 26
[1] 19 26
View(df_train)
train_controls <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
model <- train(log10_Volume ~., data = df_train, method = "lm", trControl=train_controls)
prediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleadingprediction from a rank-deficient fit may be misleading
print(model)
Linear Regression
58 samples
25 predictors
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 52, 51, 53, 51, 53, 52, ...
Resampling results:
RMSE Rsquared MAE
0.7816035 0.6272193 0.5384831
Tuning parameter 'intercept' was held constant at a value of TRUE
cat('\n df_train post resample: \n')
df_train post resample:
df_train_test = df_train
y_train_test=y_train
prediction <- predict(model, df_train_test)
prediction from a rank-deficient fit may be misleading
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
RMSE Rsquared MAE
0.2491830 0.8933307 0.1890489
cat('\n df_test post resample: \n')
df_test post resample:
df_train_test = df_test
y_train_test=y_test
prediction <- predict(model, df_train_test)
prediction from a rank-deficient fit may be misleading
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
RMSE Rsquared MAE
1.36678107 0.02705989 0.95938732
1. What do you notice about the RMSE and R-Squared values? on the training set, the RMSE and Rsquared are quite good, but the values are poor for the testing set, suggesting the model is overfitting. Furthermore, multiple errors were thrown during th fit.
2. Did the model perform well? Why or why not? No, R-square is very low for testing set.
3. If not, perhaps you used the wrong type of machine learning method on the wrong type of data. See the following resource for more information: Parametric vs non-parametric methods for data analysis
So let’s dive into using some non-parametric machine learning models:
Using the same general approach documented in the walkthrough and the steps outlined below, make sales volume predictions on the new products dataset after training and testing your models on the historical data set:
#setup seed for reproducability
set.seed(1)
# Define Label
y <- df_preprocessed$log10_Volume
#define a 75-25% train-test split of the dataset
inTraining <- createDataPartition(y, p = .75, list = FALSE)
df_train <- df_preprocessed[inTraining,]
df_test <- df_preprocessed[-inTraining,]
y_train = df_train$log10_Volume
y_test = df_test$log10_Volume
#check dimensions of train & test set
dim(df_train); dim(df_test);
[1] 58 26
[1] 19 26
Use the following 3 algorithms for your analysis; you might have to research each of these as there are variants of each in caret - you may choose which variant you need:
Support Vector Machine (SVM)
<span style='color:red'> [walkthrough link](http://dataaspirant.com/2017/01/19/support-vector-machine-classifier-implementation-r-caret-package/) <\span>train_controls <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
#View(df_train)
model <- train(log10_Volume ~., data = df_train, method = "svmLinear",
trControl=train_controls,
tuneLength = 10)
Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.
model_svmLinear <- model
print(model)
Support Vector Machines with Linear Kernel
58 samples
25 predictors
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 52, 51, 52, 53, 54, 53, ...
Resampling results:
RMSE Rsquared MAE
1.299165 0.5567494 0.7928724
Tuning parameter 'C' was held constant at a value of 1
cat('\n df_train post resample: \n')
df_train post resample:
df_train_test = df_train
y_train_test=y_train
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
RMSE Rsquared MAE
0.3243008 0.8199906 0.2046974
cat('\n df_test post resample: \n')
df_test post resample:
df_train_test = df_test
y_train_test=y_test
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
RMSE Rsquared MAE
0.9348012 0.1020904 0.7404686
train_controls <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
#View(df_train)
model <- train(log10_Volume ~., data = df_train, method = "rf",
trControl=train_controls,
tuneLength = 10)
model_rf <- model
print(model)
Random Forest
58 samples
25 predictors
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times)
Summary of sample sizes: 52, 52, 52, 52, 52, 52, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 0.2929948 0.9109390 0.2245404
4 0.2299453 0.9303843 0.1769531
7 0.2049799 0.9385415 0.1530137
9 0.2010769 0.9413503 0.1495050
12 0.2004596 0.9409190 0.1518874
14 0.2001748 0.9396720 0.1520021
17 0.1982224 0.9403279 0.1500803
19 0.1982045 0.9416796 0.1505520
22 0.1973757 0.9418469 0.1510513
25 0.2034560 0.9371050 0.1547379
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 22.
cat('\n df_train post resample: \n')
df_train post resample:
df_train_test = df_train
y_train_test=y_train
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
RMSE Rsquared MAE
0.08523613 0.98868133 0.05978294
cat('\n df_test post resample: \n')
df_test post resample:
df_train_test = df_test
y_train_test=y_test
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
RMSE Rsquared MAE
0.2852245 0.8933604 0.2039424
train_controls <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
#View(df_train)
model <- train(log10_Volume ~., data = df_train, method = "xgbTree",
trControl=train_controls,
tuneLength = 10)
model_gbTree <- model
print(model)
eXtreme Gradient Boosting
58 samples
25 predictors
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times)
Summary of sample sizes: 52, 53, 52, 52, 53, 53, ...
Resampling results across tuning parameters:
eta max_depth colsample_bytree subsample nrounds RMSE Rsquared
0.3 1 0.6 0.5000000 50 0.2284764 0.9216587
0.3 1 0.6 0.5000000 100 0.2201977 0.9261489
0.3 1 0.6 0.5000000 150 0.2143680 0.9272393
0.3 1 0.6 0.5000000 200 0.2135276 0.9295991
0.3 1 0.6 0.5000000 250 0.2130879 0.9301313
0.3 1 0.6 0.5000000 300 0.2130536 0.9302487
0.3 1 0.6 0.5000000 350 0.2120491 0.9306822
0.3 1 0.6 0.5000000 400 0.2133532 0.9303621
0.3 1 0.6 0.5000000 450 0.2134724 0.9298110
0.3 1 0.6 0.5000000 500 0.2139841 0.9302042
0.3 1 0.6 0.5555556 50 0.2205741 0.9246225
0.3 1 0.6 0.5555556 100 0.2225953 0.9257463
0.3 1 0.6 0.5555556 150 0.2218757 0.9280144
0.3 1 0.6 0.5555556 200 0.2225147 0.9265356
0.3 1 0.6 0.5555556 250 0.2268311 0.9254983
0.3 1 0.6 0.5555556 300 0.2295099 0.9246502
0.3 1 0.6 0.5555556 350 0.2288024 0.9257733
0.3 1 0.6 0.5555556 400 0.2292988 0.9255461
0.3 1 0.6 0.5555556 450 0.2298814 0.9252520
0.3 1 0.6 0.5555556 500 0.2300765 0.9255009
0.3 1 0.6 0.6111111 50 0.2438379 0.9205469
0.3 1 0.6 0.6111111 100 0.2408638 0.9221003
0.3 1 0.6 0.6111111 150 0.2324788 0.9282627
0.3 1 0.6 0.6111111 200 0.2274411 0.9309447
0.3 1 0.6 0.6111111 250 0.2270993 0.9312800
0.3 1 0.6 0.6111111 300 0.2265324 0.9311225
0.3 1 0.6 0.6111111 350 0.2267147 0.9316696
0.3 1 0.6 0.6111111 400 0.2267091 0.9313889
0.3 1 0.6 0.6111111 450 0.2263611 0.9316839
0.3 1 0.6 0.6111111 500 0.2262962 0.9317878
0.3 1 0.6 0.6666667 50 0.2080572 0.9332618
0.3 1 0.6 0.6666667 100 0.1994282 0.9407511
0.3 1 0.6 0.6666667 150 0.1966012 0.9429176
0.3 1 0.6 0.6666667 200 0.1978073 0.9418478
0.3 1 0.6 0.6666667 250 0.1969031 0.9425682
0.3 1 0.6 0.6666667 300 0.1964608 0.9424523
0.3 1 0.6 0.6666667 350 0.1957379 0.9427439
0.3 1 0.6 0.6666667 400 0.1962772 0.9426362
0.3 1 0.6 0.6666667 450 0.1963109 0.9424743
0.3 1 0.6 0.6666667 500 0.1962442 0.9425273
0.3 1 0.6 0.7222222 50 0.2227333 0.9373083
0.3 1 0.6 0.7222222 100 0.2174632 0.9429406
0.3 1 0.6 0.7222222 150 0.2139230 0.9443887
0.3 1 0.6 0.7222222 200 0.2115789 0.9458754
0.3 1 0.6 0.7222222 250 0.2105655 0.9473345
0.3 1 0.6 0.7222222 300 0.2094595 0.9481239
0.3 1 0.6 0.7222222 350 0.2096799 0.9479054
0.3 1 0.6 0.7222222 400 0.2096419 0.9480140
0.3 1 0.6 0.7222222 450 0.2097656 0.9480885
0.3 1 0.6 0.7222222 500 0.2097233 0.9481247
0.3 1 0.6 0.7777778 50 0.2224135 0.9336319
0.3 1 0.6 0.7777778 100 0.2075947 0.9408899
0.3 1 0.6 0.7777778 150 0.2021852 0.9441377
0.3 1 0.6 0.7777778 200 0.2033810 0.9436543
0.3 1 0.6 0.7777778 250 0.2021864 0.9438297
0.3 1 0.6 0.7777778 300 0.2013067 0.9443086
0.3 1 0.6 0.7777778 350 0.2013855 0.9441724
0.3 1 0.6 0.7777778 400 0.2008850 0.9443448
0.3 1 0.6 0.7777778 450 0.2005891 0.9442941
0.3 1 0.6 0.7777778 500 0.2005402 0.9444586
0.3 1 0.6 0.8333333 50 0.1996858 0.9441244
0.3 1 0.6 0.8333333 100 0.1956256 0.9480938
0.3 1 0.6 0.8333333 150 0.1931266 0.9495668
0.3 1 0.6 0.8333333 200 0.1913877 0.9504222
0.3 1 0.6 0.8333333 250 0.1912773 0.9505270
0.3 1 0.6 0.8333333 300 0.1904507 0.9508640
0.3 1 0.6 0.8333333 350 0.1901998 0.9507225
0.3 1 0.6 0.8333333 400 0.1898479 0.9506629
0.3 1 0.6 0.8333333 450 0.1893253 0.9510757
0.3 1 0.6 0.8333333 500 0.1895375 0.9509559
0.3 1 0.6 0.8888889 50 0.2080856 0.9415228
0.3 1 0.6 0.8888889 100 0.2035190 0.9443244
0.3 1 0.6 0.8888889 150 0.1990739 0.9474486
0.3 1 0.6 0.8888889 200 0.1986480 0.9485890
0.3 1 0.6 0.8888889 250 0.1953802 0.9498755
0.3 1 0.6 0.8888889 300 0.1950671 0.9502998
0.3 1 0.6 0.8888889 350 0.1947465 0.9503960
0.3 1 0.6 0.8888889 400 0.1937253 0.9508464
0.3 1 0.6 0.8888889 450 0.1934311 0.9510370
0.3 1 0.6 0.8888889 500 0.1930652 0.9513522
0.3 1 0.6 0.9444444 50 0.2244006 0.9279392
0.3 1 0.6 0.9444444 100 0.2204402 0.9328219
0.3 1 0.6 0.9444444 150 0.2201283 0.9334331
0.3 1 0.6 0.9444444 200 0.2176985 0.9352359
0.3 1 0.6 0.9444444 250 0.2160658 0.9363424
0.3 1 0.6 0.9444444 300 0.2150038 0.9370625
0.3 1 0.6 0.9444444 350 0.2139317 0.9377397
0.3 1 0.6 0.9444444 400 0.2136432 0.9379105
0.3 1 0.6 0.9444444 450 0.2132278 0.9382405
0.3 1 0.6 0.9444444 500 0.2130847 0.9384021
0.3 1 0.6 1.0000000 50 0.2190020 0.9328445
0.3 1 0.6 1.0000000 100 0.2174849 0.9347607
0.3 1 0.6 1.0000000 150 0.2135116 0.9370476
0.3 1 0.6 1.0000000 200 0.2119595 0.9383646
0.3 1 0.6 1.0000000 250 0.2110799 0.9394100
0.3 1 0.6 1.0000000 300 0.2109370 0.9399502
0.3 1 0.6 1.0000000 350 0.2108029 0.9401800
0.3 1 0.6 1.0000000 400 0.2109052 0.9403651
0.3 1 0.6 1.0000000 450 0.2111344 0.9405818
0.3 1 0.6 1.0000000 500 0.2109450 0.9407897
0.3 1 0.8 0.5000000 50 0.2204962 0.9291571
0.3 1 0.8 0.5000000 100 0.2133923 0.9351065
0.3 1 0.8 0.5000000 150 0.2181669 0.9345079
0.3 1 0.8 0.5000000 200 0.2186196 0.9356464
0.3 1 0.8 0.5000000 250 0.2187786 0.9362180
0.3 1 0.8 0.5000000 300 0.2192147 0.9366604
0.3 1 0.8 0.5000000 350 0.2192758 0.9365808
0.3 1 0.8 0.5000000 400 0.2194370 0.9368750
0.3 1 0.8 0.5000000 450 0.2197547 0.9367156
0.3 1 0.8 0.5000000 500 0.2193786 0.9370187
0.3 1 0.8 0.5555556 50 0.2192674 0.9353523
0.3 1 0.8 0.5555556 100 0.2134775 0.9354737
0.3 1 0.8 0.5555556 150 0.2136330 0.9366034
0.3 1 0.8 0.5555556 200 0.2133238 0.9365663
0.3 1 0.8 0.5555556 250 0.2138913 0.9365379
0.3 1 0.8 0.5555556 300 0.2134270 0.9371768
0.3 1 0.8 0.5555556 350 0.2141235 0.9368013
0.3 1 0.8 0.5555556 400 0.2135072 0.9371002
0.3 1 0.8 0.5555556 450 0.2135287 0.9371336
0.3 1 0.8 0.5555556 500 0.2136391 0.9371227
0.3 1 0.8 0.6111111 50 0.2351326 0.9185414
0.3 1 0.8 0.6111111 100 0.2267814 0.9215612
0.3 1 0.8 0.6111111 150 0.2280867 0.9209016
0.3 1 0.8 0.6111111 200 0.2283000 0.9212449
0.3 1 0.8 0.6111111 250 0.2266990 0.9223274
MAE
0.1730951
0.1602175
0.1568886
0.1542488
0.1542443
0.1557025
0.1551318
0.1569120
0.1572759
0.1577505
0.1683410
0.1694589
0.1687065
0.1695633
0.1714519
0.1741921
0.1732685
0.1731203
0.1741805
0.1739121
0.1855337
0.1843745
0.1766099
0.1728038
0.1723721
0.1722707
0.1726015
0.1722922
0.1719622
0.1718547
0.1639804
0.1558434
0.1523086
0.1531895
0.1518745
0.1514437
0.1505395
0.1508068
0.1509261
0.1505407
0.1676248
0.1599968
0.1570220
0.1546012
0.1543435
0.1528426
0.1521437
0.1516440
0.1515882
0.1514758
0.1714308
0.1546852
0.1493024
0.1497917
0.1492983
0.1487080
0.1486449
0.1485261
0.1483461
0.1482555
0.1612491
0.1571013
0.1513353
0.1493654
0.1486791
0.1478206
0.1477153
0.1473693
0.1470940
0.1472936
0.1576582
0.1539364
0.1513803
0.1511592
0.1488659
0.1491724
0.1495037
0.1490500
0.1488417
0.1486755
0.1685531
0.1636854
0.1628784
0.1611892
0.1602318
0.1596934
0.1588058
0.1586465
0.1582298
0.1580645
0.1621901
0.1613692
0.1577885
0.1558092
0.1544963
0.1543865
0.1539614
0.1540608
0.1541704
0.1537880
0.1733672
0.1677258
0.1694761
0.1697354
0.1698512
0.1703809
0.1703514
0.1708299
0.1713203
0.1713107
0.1596008
0.1543531
0.1558798
0.1558129
0.1566255
0.1562837
0.1567543
0.1562982
0.1564576
0.1564208
0.1853223
0.1708051
0.1703386
0.1705794
0.1699050
[ reached getOption("max.print") -- omitted 3875 rows ]
Tuning parameter 'gamma' was held constant at a value of 0
Tuning
parameter 'min_child_weight' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nrounds = 50, max_depth = 10, eta =
0.3, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1 and subsample
= 0.6111111.
cat('\n df_train post resample: \n')
df_train post resample:
df_train_test = df_train
y_train_test=y_train
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
RMSE Rsquared MAE
0.004438720 0.999970293 0.002037815
cat('\n df_test post resample: \n')
df_test post resample:
df_train_test = df_test
y_train_test=y_test
prediction <- predict(model, df_train_test)
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
RMSE Rsquared MAE
0.3040302 0.8883087 0.2165503
Be sure to take any precautions needed to guard against overfitting and longer training times
Review your models and identify the one that performed best without overfitting. You should also look at the predicted values themselves. If you have negative vales in your predictions and negative values are not possible for your dependent variable, choose a different model. Be prepared to explain why you chose to use the algorithms you did in your report.
compare_models <- function (model_list, df_train, df_test, label_column){
for (i in 1:length(model_list)){
model <- model_list[i]
model_name <- list(model[[1]]$method)[[1]]
cat(paste('\n ----- model_name:',model_name,'-----'))
cat('\n df_train post resample: \n')
df_train_test = df_train
y_train_test = df_train[,c(label_column)]
prediction <- predict(model, df_train_test)[[1]]
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
plot(y_train_test,prediction,xlab = 'Label',ylab = 'Prediction', col='blue', main = model_name)
par(new=FALSE)
cat('\n df_test post resample: \n')
df_train_test = df_test
y_train_test = df_test[,c(label_column)]
prediction <- predict(model, df_train_test)[[1]]
df_postResample<-postResample(pred = prediction, obs = y_train_test)
print(df_postResample)
points(y_train_test,prediction,xlab = 'Label',ylab = 'Prediction', col='red')
legend(4, legend = list('Train','Test'),col=c('Blue','Red'),pch='o')
}
}
model_list <- list(model_svmLinear, model_rf, model_gbTree)
compare_models(model_list, df_train, df_test, label_column = 'log10_Volume')
----- model_name: svmLinear -----
df_train post resample:
RMSE Rsquared MAE
0.3243008 0.8199906 0.2046974
df_test post resample:
RMSE Rsquared MAE
0.9348012 0.1020904 0.7404686
----- model_name: rf -----
df_train post resample:
RMSE Rsquared MAE
0.08523613 0.98868133 0.05978294
df_test post resample:
RMSE Rsquared MAE
0.2852245 0.8933604 0.2039424
----- model_name: xgbTree -----
df_train post resample:
RMSE Rsquared MAE
0.004438720 0.999970293 0.002037815
df_test post resample:
RMSE Rsquared MAE
0.3040302 0.8883087 0.2165503
the random forest and xgbTree appear the best, however the xgb tree seems to be overfitting, based on the R-squared score and the nearly perfect linear correlation between the label and the prediction in the label vs prediction plot, so we will proceed using the random forest as the best model
#load the data set
df_validation <- read_csv("newproductattributes2017.csv");
Parsed with column specification:
cols(
ProductType = [31mcol_character()[39m,
ProductNum = [32mcol_double()[39m,
Price = [32mcol_double()[39m,
x5StarReviews = [32mcol_double()[39m,
x4StarReviews = [32mcol_double()[39m,
x3StarReviews = [32mcol_double()[39m,
x2StarReviews = [32mcol_double()[39m,
x1StarReviews = [32mcol_double()[39m,
PositiveServiceReview = [32mcol_double()[39m,
NegativeServiceReview = [32mcol_double()[39m,
Recommendproduct = [32mcol_double()[39m,
BestSellersRank = [32mcol_double()[39m,
ShippingWeight = [32mcol_double()[39m,
ProductDepth = [32mcol_double()[39m,
ProductWidth = [32mcol_double()[39m,
ProductHeight = [32mcol_double()[39m,
ProfitMargin = [32mcol_double()[39m,
Volume = [32mcol_double()[39m
)
# one-hot encode (dummify) the data
df_validation_preprocessed <- dummyVars(" ~.",data = df_validation)
df_validation_preprocessed <- data.frame(predict(df_validation_preprocessed,newdata = df_validation))
#drop columsn that contain NA
drops <- c("BestSellersRank","x1StarReviews","x5StarReviews")
df_validation_preprocessed <- df_validation_preprocessed[,!(names(df_validation_preprocessed) %in% drops)]
#Make predictions
model <- model_gbTree
df_train_test = df_validation_preprocessed
prediction <- predict(model, df_train_test)
prediction_validation <- prediction
prediction_validation_Volume <- 10^(prediction_validation)
#Add predictions to df
df_validation_w_predictions <- df_validation
df_validation_w_predictions['Predicted_Volume'] <- prediction_validation_Volume
#sort the df
df_validation_w_predictions <- df_validation_w_predictions[order(df_validation_w_predictions$Predicted_Volume),]
#Add unique ID column
df_validation_w_predictions['ProductType_ProductNumber_Price']<- with(df_validation_w_predictions, paste0(ProductType,'_#', ProductNum,'_$', Price))
par(mar=c(11,4,1,1))
barplot(height = df_validation_w_predictions$Predicted_Volume, names.arg = df_validation_w_predictions$ProductType_ProductNumber_Price, las=2, cex.axis = .8 , cex.names = 0.8, ylab = 'Volume')
#aggregate by Product Type
df_ProductType_aggregate <- aggregate(df_validation_w_predictions$Predicted_Volume, by=list(Category=df_validation_w_predictions$ProductType), FUN=sum)
colnames(df_ProductType_aggregate) <- c("ProductType", "Total_Predicted_Volume")
# sort the aggregate
df_ProductType_aggregate <- df_ProductType_aggregate[order(df_ProductType_aggregate$Total_Predicted_Volume),]
par(mar=c(9,4,1,4))
barplot(height = df_ProductType_aggregate$Total_Predicted_Volume, names.arg = df_ProductType_aggregate$ProductType, las=2, ylab = 'Total Volume')
#Plot Ratings and Reviews vs. Volume
x<-log10(df_validation_w_predictions$x4StarReviews)
y<-log10(df_validation_w_predictions$Predicted_Volume)
plot(x,y,col='red',xlab = "log10(# of ratings)", ylab = "log10(Predicted_Volume)")
x<-log10(df_validation_w_predictions$x3StarReviews)
points(x,y,col='green')
x<-log10(df_validation_w_predictions$x2StarReviews)
points(x,y,col='blue')
legend(2.1,2.2, legend = list('4 Stars','3 Stars','2 Stars'),col=c('red','green','blue'),pch='o')
#Plot service reviews
x<-log10(df_validation_w_predictions$PositiveServiceReview)
y<-log10(df_validation_w_predictions$Predicted_Volume)
plot(x,y,col='green',xlab = "log10(# of Service Reviews)", ylab = "log10(Predicted_Volume)")
x<-log10(df_validation_w_predictions$NegativeServiceReview)
points(x,y,col='red')
legend(2, legend = list('Positive','Negative'),col=c('green','red'),pch='o')
Often times it is helpful for report building to output your data set and predictions from RStudio. Let’s add your predictions to the new products data and then create a csv file. Use your csv and Excel to organize your data for reporting.
This was done in the previous part
write.csv(df_validation_w_predictions, file="C2.T3output.csv", row.names = FALSE)
We’ll just stick to organizing the data in R
Write an informal report to Danielle Sherman, in Word or PowerPoint, describing your analysis. In addition to presenting your findings, you might address questions such as the following:
In this report we review multiple regression techniques used to predict the volumes for new products from Blackwell Electronics. In developing these multiple regression models, we performed a number of preprocessing steps to elliminate features with high colinearity, scale the data so all features had similar numeric ranges, one-hot encoded (dummified) categorical string data to allow the models to leverage these categorical features, and transformed the label of interest, Volume, into a log-scale to prevent the model from ever predicting volumes of less than 0.
The “existingproductattributes2017.csv” data set was used the build the multiple regression models. After pulling the data into R, the categorical feature, Product Type, was one-hot encoded (dummified). In this step, the “dummyVars” function basically determines the number of categories, n_cat of Producty Types, then for each row of data, n_cat columns are created, with each column labeled as one of the product types. The cells of these new columns are then populated with zeros, if that given data row does not correspond to the particular product type in that column, and a value of one is populated in the cell of the product type column that the row of data was originally labeled to correspond to. In this way, we transformed string-based categorical data into numeric data, which the machine learning algorithms can leverage as features in making predictions and training.
Following the one-hot encoding, we analyzed the data to see if any columns contained “NA” values. “BestSellersRank” was observed to contain 15 NA cells, thus this column was dropped as a relevant feature.
Next, we analyzed the correlations in the data sets using a correlation plot, shown below
Here, the deep blue colors represent strong positive correlations, while the deep red cells represent strong negative correlations. Using the correlation covariance values from this table, we filtered out the features with colinearities >0.95 (“x1StarReviews” with “x2StarReviews”. Furthermore, we discovered the “x5starReviews” feature had a perfect correlation of 1 with the “Volumes” label. This is a suspiciously good correlation between a feature and label, thus we analyzed the data in a scatter plot, shown below.
As can be seen, this feature and our label of interest have a perfect correlation, which implies there is likely some data entry error with the “x5starReviews” feature. For this reason, this column was droped as a relevant feature for our models.
Following the exclusion of the “x1StarReviews” and “x5StarReviews” features, we transformed the label column (“Volume”) into a log10 scale. This was done to prevent any of the models from ever predicting negative volumes, since a negative prediction for log10(Volume), would simply correspond to a volume <1 (10^-1 = 0.1).
Finally, the data was split into a training and testing set using a 75-25% train-test partition.
Three models were evaluated: (1) support vector machine (SVM) with linear kernel, (2) random forest (RF), and (3) eXtreme gradient boosted tree. From each model, the train and test RMSE and R-squared was calculated. The table below shows the summary of the results.
“Train Test Metrics”
To visualize the results, we also plotted the true log10(Volume) label vs. the log10(Volume) prediction. Viewing the RMSE & R-squared summary table, along with the label vs. prediction plots, we can see that the RF and gbTree are the best models. However, comparing these two models, we also see that the gbTree has an R-squared nearly equal to one (0.9999) and the trend in the label vs prediction plot is almost perfectly linear on the training set. These two facts suggest this model is overfitting the training data and thus the RF is a better model for generalization.
Using the trained RF model, we performed predictions for the new products defined in the “newproductattributes.csv” data set. Prior to feeding the data into the trained model, we carried out the previously mentioned preprocessing steps (one-hot encode the product type, drop the “BestSellerRank”, “x1StarReviews”, and “x5StarReviews” columns, and transform volume to log10(volume)). After predicting the log10(Volume) for each case in the new product attritubes table, we extracted the predicted volume from the predicted log10(Volume) column. The bar chart below shows the breakdown of the Total (aggregate) predicted volume vs. Product Type.
Here, we can see Tablets and Game Consoles are expected to have the highest sales volumes. Diving deeper into the data, we can breakdown the products further by product type, product number, and price. The bar chart below shows this breakdown.
From these, we can more clearly see which unique product is expected to have the highest sales volume. Specifically, we see Tablet #187, sold at $199, contributes to the majority of the total volume sold by tablets, while Game Console #307 and #199 have nearly equal contribution to the total game console volume sold.
These conclusions have 2 key business value implications: (1) if the objective of sales is to minimize the number of products while maximizing sales, then focusing on Tablet #187 is the best course of action. (2) if the objective of sales is to offer the widest range of product types, while maximizing sales volume, then the team should focus on PC#17, Tablet #186, Smartphone #194, Netbook #180, Game Consoles #307 and/or #199, and Tablet #187.
Finally, the last prediction that may be of interest to the Sales team is the impact of customer rating and sales review on Volume. The scatter plot below shows the predicted volume vs # of ratings for 4 star, 3 star, and 2 star ratings, all on a log-log scale.
Here, we can see that there is essentially a linear relationship between log10(# of ratings) and log10(predicted Volume).
Similar to the # of ratings, we also see a linear relationship between the log10(Volume) and log10(# of service reviews (positive and negative)), as can be seen in the plot below.
Overall I found this to be the most challenging of the tasks we have completed, largely because it required more individual exploration, rather than following the plan of attack line by line. Overall, I think this style of the activity was very educational. The part I had the most trouble with was running the initial linear model, as the errors the model was throwing were somewhat strange and there wasn’t a consistant answer online as to what they actually mean. Other than that though, I did find it pretty straightforward to rerun predictions using different models, though I wish R had better “function” capabilities, more similar to python, because I did find myself just copying and pasting lines of code because I didn’t feel like dealing with the unique characteristics of R functions.